LOSSLESS COMPRESSION AND ALPHABET SIZE by DANIEL
نویسنده
چکیده
Lossless data compression through exploiting redundancy in a sequence of symbols is a well-studied field in computer science and information theory. One way to achieve compression is to statistically model the data and estimate model parameters. In practice, most general purpose data compression algorithms model the data as stationary sequences of 8-bit symbols. While this model fits very well the currently used computer architectures and the vast majority of information representation standards, other models may have both computational and information theoretic merits in being more efficient in implementation or fitting some data closer. In addition, compression algorithms based on the 8 bit symbol model perform very poorly on data represented by binary sequences not aligned with byte boundaries either because the fixed symbol length is not a multiple of 8 bits (e.g. DNA sequences) or because the symbols of the source are encoded into bit sequences of variable length. Throughout this thesis, we assume that the source alphabet consists of blocks of equal size of elementary symbols (typically bits), and address the impact of this block size on lossless compression algorithms in general and in the context of socalled block-sorting compression algorithms in particular. These algorithms are quite popular both in theory and in practice and are the subjects of intensive research with many interesting results in recent years. We show that compression on the bit level is tolerant to sources that are not aligned to byte boundaries, while performing reasonably well for byte-aligned sources.
منابع مشابه
Procedures of extending the alphabet for the PPM algorithm
In this paper it is presented the lossless PPM (Prediction by Partial string Matching) algorithm and it is studied the way the alphabet can be extended for the PPM encoding so it will allow the use of symbols which are not present in the alphabet at the beginning of the encoding phase. The extended alphabet can contain symbols with the size larger than a byte. The paper presents the manner to e...
متن کاملA Fast and E cient Nearly-Optimal Adaptive Fano Coding Scheme
Adaptive coding techniques have been increasingly used in lossless data compression. They are suitable for a wide range of applications, in which on-line compression is required, including communications, internet, e-mail, and e-commerce. In this paper, we present an adaptive Fano coding method applicable to binary and multi-symbol code alphabets. We introduce the corresponding partitioning pro...
متن کاملRecent results in combined coding for word-based PPM
In this paper it is presented the lossless PPM (Prediction by Partial string Matching) algorithm and it is studied the way the extended alphabet can be used for the PPM encoding so it will allow the use of symbols which are not present in the alphabet at the beginning of the encoding phase. The extended alphabet can contain symbols with the size larger than a byte and at the decoding external w...
متن کاملar X iv : c s / 06 03 06 8 v 1 [ cs . I T ] 1 7 M ar 2 00 6 Universal Lossless Compression with Unknown Alphabets - The Average
Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alp...
متن کاملGrammar-based codes: A new class of universal lossless source codes
We investigate a type of lossless source code called a grammar-based code, which, in response to any input data string over a fixed finite alphabet, selects a context-free grammar representing in the sense that is the unique string belonging to the language generated by . Lossless compression of takes place indirectly via compression of the production rules of the grammar . It is shown that, su...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006